Model interpretation study

Using DALEX library

Author
Affiliation

Create models

Code
import dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
Code
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

preprocess = make_column_transformer(
    (StandardScaler(), ['age', 'fare', 'parch', 'sibsp']),
    (OneHotEncoder(), ['gender', 'class', 'embarked']))

Logistic regression model

Code
from sklearn.linear_model import LogisticRegression

titanic_lr = make_pipeline(
    preprocess,
    LogisticRegression(penalty = 'l2'))
titanic_lr.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Logistic regression model

Code
from sklearn.ensemble import RandomForestClassifier

titanic_rf = make_pipeline(
    preprocess,
    RandomForestClassifier(max_depth = 3, n_estimators = 500))
titanic_rf.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=3, n_estimators=500))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Gradient boosting model

Code
from sklearn.ensemble import GradientBoostingClassifier

titanic_gbc = make_pipeline(
    preprocess,
    GradientBoostingClassifier(n_estimators = 100))
titanic_gbc.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('gradientboostingclassifier', GradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Support vector machine model

Code
from sklearn.svm import SVC

titanic_svm = make_pipeline(
    preprocess,
    SVC(probability = True))
titanic_svm.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('svc', SVC(probability=True))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Models’ predictions

Code
import pandas as pd

johnny_d = pd.DataFrame({'gender': ['male'],
                       'age'     : [8],
                       'class'   : ['1st'],
                       'embarked': ['Southampton'],
                       'fare'    : [72],
                       'sibsp'   : [0],
                       'parch'   : [0]},
                      index = ['JohnnyD'])

henry = pd.DataFrame({'gender'   : ['male'],
                       'age'     : [47],
                       'class'   : ['1st'],
                       'embarked': ['Cherbourg'],
                       'fare'    : [25],
                       'sibsp'   : [0],
                       'parch'   : [0]},
                      index = ['Henry'])

Instance Level

Break-down plots for additive attibutions

This concept answer the question: which variables contribute to this result the most?

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                  label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.174, mean = 0.322, max = 0.889
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.829, mean = -5.07e-05, max = 0.824
  -> model_info        : package sklearn

A new explainer has been created!
Code
bd_henry = titanic_rf_exp.predict_parts(henry, 
             type = 'break_down')
bd_henry.result
variable_name variable_value variable cumulative contribution sign position label
0 intercept 1 intercept 0.322207 0.322207 1.0 8 Titanic RF Pipeline
1 class 1st class = 1st 0.394536 0.072328 1.0 7 Titanic RF Pipeline
2 embarked Cherbourg embarked = Cherbourg 0.423858 0.029323 1.0 6 Titanic RF Pipeline
3 fare 25.0 fare = 25.0 0.429974 0.006116 1.0 5 Titanic RF Pipeline
4 sibsp 0.0 sibsp = 0.0 0.428781 -0.001193 -1.0 4 Titanic RF Pipeline
5 parch 0.0 parch = 0.0 0.423522 -0.005259 -1.0 3 Titanic RF Pipeline
6 age 47.0 age = 47.0 0.416755 -0.006766 -1.0 2 Titanic RF Pipeline
7 gender male gender = male 0.308259 -0.108496 -1.0 1 Titanic RF Pipeline
8 prediction 0.308259 0.308259 1.0 0 Titanic RF Pipeline
Code
bd_henry.plot()
Code
import numpy as np

bd_henry = titanic_rf_exp.predict_parts(henry,
        type = 'break_down',
        order = np.array(['gender', 'class', 'age',
            'embarked', 'fare', 'sibsp', 'parch']))
bd_henry.plot(max_vars = 5)

Break-down plots for additive interactions

Interaction (deviation from additivity) means that the effect of an explanatory variable depends on the value(s) of other variable(s).

Code
ibd_henry = titanic_rf_exp.predict_parts(henry, 
                type = 'break_down_interactions', 
                interaction_preference = 10)
ibd_henry.result
variable_name variable_value variable cumulative contribution sign position label
0 intercept 1 intercept 0.322207 0.322207 1.0 5 Titanic RF Pipeline
1 class:gender 1st:male class:gender = 1st:male 0.295512 -0.026696 -1.0 4 Titanic RF Pipeline
2 fare:embarked 25.0:Cherbourg fare:embarked = 25.0:Cherbourg 0.328518 0.033006 1.0 3 Titanic RF Pipeline
3 parch:sibsp 0.0:0.0 parch:sibsp = 0.0:0.0 0.318294 -0.010224 -1.0 2 Titanic RF Pipeline
4 age 47.0 age = 47.0 0.308259 -0.010035 -1.0 1 Titanic RF Pipeline
5 prediction 0.308259 0.308259 1.0 0 Titanic RF Pipeline
Code
ibd_henry.plot()

Shapley Additice Explanations (SHAP) for average attributions

To remove the influence of the ordering of the variables.

Code
bd_henry = titanic_rf_exp.predict_parts(henry, type = 'shap')
bd_henry.result
variable contribution variable_name variable_value sign label B
0 embarked = Cherbourg 0.021287 embarked Cherbourg 1.0 Titanic RF Pipeline 1
1 sibsp = 0.0 0.000773 sibsp 0 1.0 Titanic RF Pipeline 1
2 gender = male -0.091533 gender male -1.0 Titanic RF Pipeline 1
3 age = 47.0 -0.005456 age 47 -1.0 Titanic RF Pipeline 1
4 parch = 0.0 -0.006794 parch 0 -1.0 Titanic RF Pipeline 1
... ... ... ... ... ... ... ...
2 embarked = Cherbourg 0.022800 embarked Cherbourg 1.0 Titanic RF Pipeline 0
3 parch = 0.0 -0.005639 parch 0 -1.0 Titanic RF Pipeline 0
4 age = 47.0 -0.005172 age 47 -1.0 Titanic RF Pipeline 0
5 fare = 25.0 0.004127 fare 25 1.0 Titanic RF Pipeline 0
6 sibsp = 0.0 -0.001006 sibsp 0 -1.0 Titanic RF Pipeline 0

182 rows × 7 columns

Code
bd_henry.plot()

Local Interpretable Model-agnostic Explanations (LIME)

Break-down (BD) plots and Shapley values, are most suitable for models with a small or moderate number of explanatory variables. The most popular example of such sparse explainers is the Local Interpretable Model-agnostic Explanations (LIME) method and its modifications.

In the first step, we read the Titanic data and encode categorical variables. In this case, we use the simplest encoding for gender, class, and embarked, i.e., the label-encoding.

Code
import dalex as dx

titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

X['gender']   = le.fit_transform(X['gender'])
X['class']    = le.fit_transform(X['class'])
X['embarked'] = le.fit_transform(X['embarked'])

In the next step we train a random forest model.

Code
from sklearn.ensemble import RandomForestClassifier as rfc
titanic_fr = rfc()
titanic_fr.fit(X, y)
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

It is time to define the observation for which model prediction will be explained. We write Henry’s data into pandas.Series object.

Code
import pandas as pd
henry = pd.Series([1, 47.0, 0, 1, 25.0, 0, 0], 
                  index =['gender', 'age', 'class', 'embarked',
                          'fare', 'sibsp', 'parch']) 

The lime library explains models that operate on images, text, or tabular data. In the latter case, we have to use the LimeTabularExplainer module.

Code
from lime.lime_tabular import LimeTabularExplainer 
explainer = LimeTabularExplainer(X, 
                      feature_names=X.columns, 
                      class_names=['died', 'survived'], 
                      discretize_continuous=False, 
                      verbose=True)

The result is an explainer that can be used to interpret a model around specific observations. In the following example, we explain the behaviour of the model for Henry. The explain_instance() method finds a local approximation with an interpretable linear model. The result can be presented graphically with the show_in_notebook() method.

Code
lime = explainer.explain_instance(henry, titanic_fr.predict_proba)
lime.show_in_notebook(show_table=True)
Intercept 0.3565169105917955
Prediction_local [0.37588424]
Right: 0.28

Ceteris-paribus profiles

“Ceteris paribus” is a Latin phrase meaning “other things held constant” or “all else unchanged”.

Ceteris-paribus (CP) profiles show how a model’s prediction would change if the value of a single exploratory variable changed.

Code
import dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
Code
from sklearn.ensemble import RandomForestClassifier

titanic_rf = make_pipeline(
    preprocess,
    RandomForestClassifier(max_depth = 3, n_estimators = 500))
titanic_rf.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=3, n_estimators=500))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
henry = pd.DataFrame({'gender'   : ['male'],
                       'age'     : [47],
                       'class'   : ['1st'],
                       'embarked': ['Cherbourg'],
                       'fare'    : [25],
                       'sibsp'   : [0],
                       'parch'   : [0]},
                      index = ['Henry'])


import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                  label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
cp_henry = titanic_rf_exp.predict_profile(henry)
cp_henry.result
gender age class embarked fare sibsp parch _original_ _yhat_ _vname_ _ids_ _label_
Henry male 47.000000 1st Cherbourg 25.0 0.0 0.00 male 0.300351 gender Henry Titanic RF Pipeline
Henry female 47.000000 1st Cherbourg 25.0 0.0 0.00 male 0.822174 gender Henry Titanic RF Pipeline
Henry male 0.166667 1st Cherbourg 25.0 0.0 0.00 47 0.420320 age Henry Titanic RF Pipeline
Henry male 0.905000 1st Cherbourg 25.0 0.0 0.00 47 0.420320 age Henry Titanic RF Pipeline
Henry male 1.643333 1st Cherbourg 25.0 0.0 0.00 47 0.417190 age Henry Titanic RF Pipeline
... ... ... ... ... ... ... ... ... ... ... ... ...
Henry male 47.000000 1st Cherbourg 25.0 0.0 8.64 0 0.342079 parch Henry Titanic RF Pipeline
Henry male 47.000000 1st Cherbourg 25.0 0.0 8.73 0 0.342079 parch Henry Titanic RF Pipeline
Henry male 47.000000 1st Cherbourg 25.0 0.0 8.82 0 0.342079 parch Henry Titanic RF Pipeline
Henry male 47.000000 1st Cherbourg 25.0 0.0 8.91 0 0.342079 parch Henry Titanic RF Pipeline
Henry male 47.000000 1st Cherbourg 25.0 0.0 9.00 0 0.342079 parch Henry Titanic RF Pipeline

419 rows × 12 columns

Code
cp_henry.plot(variables = ['age', 'fare'])
Code
cp_henry.plot(variables = ['class', 'embarked'],
               variable_type = 'categorical')
Code
from sklearn.linear_model import LogisticRegression

titanic_lr = make_pipeline(
    preprocess,
    LogisticRegression(penalty = 'l2'))
titanic_lr.fit(X, y)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['age', 'fare', 'parch',
                                                   'sibsp']),
                                                 ('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['gender', 'class',
                                                   'embarked'])])),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
import dalex as dx
titanic_lr_exp = dx.Explainer(titanic_lr, X, y, 
                  label = "Titanic RL Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Titanic RL Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.009, mean = 0.322, max = 0.97
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.96, mean = -5.83e-07, max = 0.964
  -> model_info        : package sklearn

A new explainer has been created!
Code
cp_henry2 = titanic_lr_exp.predict_profile(henry)
cp_henry.plot(cp_henry2, variables = ['age', 'fare'])

Dataset Level

Model-performance Measures

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                  label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
mp_rf = titanic_rf_exp.model_performance(model_type = "classification", 
          cutoff = 0.5)
mp_rf.result
recall precision f1 accuracy auc
Titanic RF Pipeline 0.500703 0.765591 0.605442 0.78976 0.804917
Code
import plotly.express as px
from sklearn.metrics import roc_curve, auc
y_score = titanic_rf_exp.predict(X)
fpr, tpr, thresholds = roc_curve(y, y_score)
fig = px.area(x=fpr, y=tpr,
    title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=700, height=500)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()
Code
df = pd.DataFrame({'False Positive Rate': fpr,
        'True Positive Rate': tpr }, index=thresholds)
df.index.name = "Thresholds"
df.columns.name = "Rate"
fig_thresh = px.line(df, 
    title='TPR and FPR at every threshold', width=700, height=500)
fig_thresh.update_yaxes(scaleanchor="x", scaleratio=1)
fig_thresh.update_xaxes(range=[0, 1], constrain='domain')
fig_thresh.show()

Variable-importance Measures

model-specific Vs. model-agnostic

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                  label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
mp_rf = titanic_rf_exp.model_parts()
mp_rf.result
variable dropout_loss label
0 _full_model_ 0.194832 Titanic RF Pipeline
1 sibsp 0.197510 Titanic RF Pipeline
2 parch 0.198713 Titanic RF Pipeline
3 embarked 0.198997 Titanic RF Pipeline
4 fare 0.203987 Titanic RF Pipeline
5 age 0.208972 Titanic RF Pipeline
6 class 0.263713 Titanic RF Pipeline
7 gender 0.357970 Titanic RF Pipeline
8 _baseline_ 0.489277 Titanic RF Pipeline
Code
mp_rf.plot()
Code
vi_grouped = titanic_rf_exp.model_parts(
                variable_groups={'personal': ['gender', 'age', 
                                              'sibsp', 'parch'],
                                   'wealth': ['class', 'fare']})
vi_grouped.result
variable dropout_loss label
0 _full_model_ 0.199035 Titanic RF Pipeline
1 wealth 0.267039 Titanic RF Pipeline
2 personal 0.390928 Titanic RF Pipeline
3 _baseline_ 0.508033 Titanic RF Pipeline
Code
vi_grouped.plot()

Partial-dependence profiles

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                  label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
pd_rf = titanic_rf_exp.model_profile(variables = ['age', 'fare'])
pd_rf.result
_vname_ _label_ _x_ _yhat_ _ids_
0 age Titanic RF Pipeline 0.166667 0.414360 0
1 age Titanic RF Pipeline 0.905000 0.416245 0
2 age Titanic RF Pipeline 1.643333 0.412445 0
3 age Titanic RF Pipeline 2.381667 0.412445 0
4 age Titanic RF Pipeline 3.120000 0.413758 0
... ... ... ... ... ...
197 fare Titanic RF Pipeline 491.578272 0.364379 0
198 fare Titanic RF Pipeline 496.698879 0.364379 0
199 fare Titanic RF Pipeline 501.819486 0.364379 0
200 fare Titanic RF Pipeline 506.940093 0.364379 0
201 fare Titanic RF Pipeline 512.060700 0.364379 0

202 rows × 5 columns

Code
pd_rf.plot()
Code
pd_rf.plot(geom = 'profiles')
Code
pd_rf = titanic_rf_exp.model_profile( variable_type = 'categorical')
pd_rf.plot(variables = ['gender', 'class'])
Code
pd_rf = titanic_rf_exp.model_profile(groups = 'gender', 
                                  variables = ['age', 'fare'])
pd_rf.plot()
Code
import dalex as dx
titanic_lr_exp = dx.Explainer(titanic_lr, X, y, 
                  label = "Titanic RL Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Titanic RL Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.009, mean = 0.322, max = 0.97
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.96, mean = -5.83e-07, max = 0.964
  -> model_info        : package sklearn

A new explainer has been created!
Code
pdp_rf = titanic_rf_exp.model_profile()
pdp_lr = titanic_lr_exp.model_profile()
Code
pdp_rf.plot(pdp_lr, variables = ['age', 'fare'])

Local-dependence and accumulated-local profiles

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                    label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
ld_rf = titanic_rf_exp.model_profile(type = 'conditional')
ld_rf.result['_label_'] = 'LD profiles'
ld_rf.result
_vname_ _label_ _x_ _yhat_ _ids_
0 age LD profiles 0.166667 0.409622 0
1 age LD profiles 0.905000 0.410774 0
2 age LD profiles 1.643333 0.406121 0
3 age LD profiles 2.381667 0.405469 0
4 age LD profiles 3.120000 0.406072 0
... ... ... ... ... ...
96 sibsp LD profiles 7.680000 0.370785 0
97 sibsp LD profiles 7.760000 0.371534 0
98 sibsp LD profiles 7.840000 0.372242 0
99 sibsp LD profiles 7.920000 0.372909 0
100 sibsp LD profiles 8.000000 0.373537 0

404 rows × 5 columns

Code
ld_rf.plot(variables = ['age', 'fare'])
Code
al_rf = titanic_rf_exp.model_profile(type = 'accumulated')
al_rf.result['_label_'] = 'AL profiles'
Code
al_rf.plot(ld_rf, variables = ['age', 'fare'])

Residual-diagnostics plots

Code
import dalex as dx
titanic_rf_exp = dx.Explainer(titanic_rf, X, y, 
                    label = "Titanic RF Pipeline")
Preparation of a new explainer is initiated

  -> data              : 2207 rows 7 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 2207 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Titanic RF Pipeline
  -> predict function  : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.17, mean = 0.321, max = 0.901
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.834, mean = 0.0008, max = 0.827
  -> model_info        : package sklearn

A new explainer has been created!
Code
md_rf = titanic_rf_exp.model_diagnostics()
md_rf.result
gender age class embarked fare sibsp parch y y_hat residuals abs_residuals label ids
0 male 42.0 3rd Southampton 7.11 0 0 0 0.189694 -0.189694 0.189694 Titanic RF Pipeline 1
1 male 13.0 3rd Southampton 20.05 0 2 0 0.255173 -0.255173 0.255173 Titanic RF Pipeline 2
2 male 16.0 3rd Southampton 20.05 1 1 0 0.237100 -0.237100 0.237100 Titanic RF Pipeline 3
3 female 39.0 3rd Southampton 20.05 1 1 1 0.555217 0.444783 0.444783 Titanic RF Pipeline 4
4 female 16.0 3rd Southampton 7.13 0 0 1 0.538864 0.461136 0.461136 Titanic RF Pipeline 5
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2202 male 41.0 deck crew Belfast 0.00 0 0 1 0.360616 0.639384 0.639384 Titanic RF Pipeline 2203
2203 male 40.0 victualling crew Southampton 0.00 0 0 1 0.197225 0.802775 0.802775 Titanic RF Pipeline 2204
2204 male 32.0 engineering crew Southampton 0.00 0 0 0 0.200158 -0.200158 0.200158 Titanic RF Pipeline 2205
2205 male 20.0 restaurant staff Southampton 0.00 0 0 0 0.171052 -0.171052 0.171052 Titanic RF Pipeline 2206
2206 male 26.0 restaurant staff Southampton 0.00 0 0 0 0.172724 -0.172724 0.172724 Titanic RF Pipeline 2207

2207 rows × 13 columns

Code
md_rf.plot()
Code
md_rf.plot(variable = "ids", yvariable = "abs_residuals")